scale examples we describe in the earlier section “Looking at Levels of Measurement”), data storage
gets even more interesting. First, you have to ask yourself, “Is this variable a Choose only one or
Choose all that apply variable?” The coding is completely different for these two kinds of multiple-
choice variables.
You handle the Choose only one situation just as we describe for Type of Caregiver in the preceding
section — you establish numeric code for each alternative. For the Likert scale example, if the item
asked about patient satisfaction, you could have a categorical variable called PatSat, with five
possible values: 1 for strongly disagree, 2 for somewhat disagree, 3 for neither agree nor disagree, 4
for somewhat agree, and 5 for strongly agree. And for the Type of Caregiver example, if only one kind
of caregiver is allowed to be chosen from the three choices of nurse, physician, or social worker, you
can have a categorical variable called CaregiverType with three possible values: 1 for nurse, 2 for
physician, and 3 for social worker. Depending upon the study, you may also choose to add a 4 for
other, and a 9 for unknown (9, 99, and 999 are codes conventionally reserved for unknown). If you find
unexpected values, it is important to research and document what these mean to help future analysts
encountering the same data.
But the situation is quite different if the variable is Choose all that apply. For the Type of Caregiver
example, if the patient is being served by a team of caregivers, you have to set up your database
differently. Define separate variables in the database (separate columns in Excel) — one for each
possible category value. Imagine that you have three variables called Nurse, Physician, and SW (the
SW stands for social worker). Each variable is a two-value category, also known as a two-state flag,
and is populated as 1 for having the attribute and 0 for not having the attribute. So, if participant 101’s
care team includes only a physician, participant 102’s care team includes a nurse and a physician, and
participant 103’s care team includes a social worker and a physician, the information can be coded as
shown in the following table.
Subject Nurse Physician SW
101
0
1
0
102
1
1
0
103
0
1
1
If you have variables with more than two categories, missing values theoretically can be indicated by
leaving the cell blank, but blanks are difficult to analyze in statistical software. Instead, categories
should be set up for missing values so they can be part of the coding system (such as using a numerical
code to indicate unknown, refused, or not applicable). The goal is to make sure that for every
categorical variable, a numerical code is entered and the cell is not left blank.
Never try to cram multiple choices into one column! For example, don’t enter 1, 2 into a cell
in the CaregiverType column to indicate the patient has a nurse and physician. If you do, you have
to painstakingly split your single multi-valued column into separate two-state flag columns
(described earlier) before you analyze the data. Why not do it right the first time?
Recording numerical data